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Abstract 

We address two issues which arise in the task of detecting anomalous behavior in complex sys- 
tems with numerous sensor channels: how to adjust alarm thresholds dynamically, within the 
changing operating context of the system, and how to utilize sensors selectively, so that nominal 
operation can be verified reliably without processing a prohibitive amount of sensor data. Our 
approach involves simulation of a causal model of the system, which provides information on 
expected sensor values, and on dependencies between predicted events, useful in assessing the 
relative importance of events so that sensor resources can be allocated effectively. We discuss 
briefly the potential applicability of this work to the execution monitoring of robot task plans. 


The Monitoring Problem 

Timely detection of anomalous behavior is essential for the continuous safe operation and lon- 
gevity of aerospace systems. The pilot of a jet aircraft must be aware of any conditions which 
may affect thrust during the critical moments of takeoff. The thermal environment onboard 
Space Station Freedom must be carefully controlled to provide uninterrupted life support for 
the crew. The Mars Rover must react quickly to an unpredictable environment or the mission 
may come to an abrupt conclusion. 

The monitoring problem becomes more difficult when the behavior of a physical system in- 
volves interactions among components or interaction with an environment. Under these condi- 
tions, correct operation becomes context-dependent; it is not possible to determine a priori a 
set of sensor values which always imply nominal operation. Moreover, when the number of 
sensors in a physical system becomes very large, the ability to combine sensor data into a pic- 
ture of the global state of a system becomes compromised. Studies of plant catastrophes have 
revealed that information which might have been useful in preventing disaster was typically 
available but was not prominent enough within the overwhelming morass of data presented to 
operators. 

In this paper, we concentrate on the initial step in the monitoring process--detecting anoma- 
lous behavior quickly and reliably. We do not address here the equally important steps of 
tracking faulted behavior and determining control actions to continue operation in the presence 
of faults. Within this focus, we address two important issues: (1) how to adjust nominal sensor 
value expectations dynamically, taking into account the changing operating context of the sys- 
tem, and (2) how to utilize sensors selectively, determining which subset of the available sen- 
sors to use at any given time to verify nominal operation efficiently, without processing a pro- 
hibitive amount of data. 
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Two Issues 


The traditional approach to verifying the correct operation of a system being monitored involves 
associating alarm thresholds with sensors. Fixed threshold values for each sensor are deter- 
mined ahead of time by analyzing the designed nominal behavior for the system. Whenever a 
sensor value crosses a threshold during operation, an alarm is raised. 

The problem with this approach is that the nominal behavior of even moderately complex sys- 
tems often depends on context. For example, an earth-orbiting spacecraft periodically enters 
and emerges from the Earth’s shadow. Impingent solar radiation changes the thermal profile of 
the spacecraft, as does the configuration of currently active and consequently, heat-generating 
subsystems on board. Thresholds on temperature sensors should be adjusted accordingly. A 
particular temperature value may be indicative of a problem when the spacecraft is in shadow 
or mostly inactive, but may be within acceptable limits when the spacecraft is in sunlight or 
many on-board systems are operating. 

Fixed alarm thresholds are useful for defining the operating limits of a physical system, such as 
the point of overbalance of a rover, or the temperature at which, say, the onboard computer of a 
spacecraft is at risk. Nonetheless, they are woefully inadequate for verifying the nominal oper- 
ation of a system with many operating modes, or one which interacts with an environment. The 
problem is that fixed alarm thresholds are derived from an over-summarized model of the be- 
havior of a system. If the thresholds are chosen conservatively, then false alarms occur. If 
they are chosen boldly, then undetected anomalies occur. What is needed is a capability for ad- 
justing alarm thresholds dynamically. Alarm thresholds should be chosen according to expec- 
tations about the nominal behavior of a system as it changes in different operating contexts. 
Later on in this paper, we present our approach to dynamic alarm threshold adjustment based 
on causal simulation of the device. 


Another issue which arises in monitoring concerns how to best utilize available sensors to effi- 
ciently and reliably, but not necessarily comprehensively, verify the nominal operation of a 
physical system. Just as the nominal values in a system being monitored depend on context, so 
do the subset of sensors which can most directly verify those values depend on context. The fa- 
miliar activity of driving an automobile helps to illustrate this idea. A variety of sensors are 
provided to the operator of an automobile: fuel gauge, temperature gauge, speedometer, several 
mirrors, etc. However, the driver does not use all of these diverse sensors all of the time. The 
speedometer may be checked periodically, or when a speed limit sign is passed; the right-side 
mirror is probably only used during lane changes. There are two points to be made: one con- 
cerns relevance, the other concerns resources. 


Individual sensors are appropriate for verifying only some small, localized subset of the possi- 
ble behavior of a system. The choice of which sensors to sample and interpret at any particular 
time should be based on expectations of what is to happen in the system and, perhaps, how it is 
to interact with an environment. However, even after a suitable subset of the available sensors 
is identified, there may not be the resources available, whether human or machine, to sample 
all the selected sensors and interpret the data within a required response frame. What is needed 
is a capability for assessing the importance of predicted events, so that while it may not be pos- 
sible to comprehensively verify the expected behavior of a system, still the most reliable veri- 
fication within available resources can be performed. 
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An illustration of the need to focus attention in monitoring comes from the jet aircraft domain. 
Some of the recent commercial aircraft catastrophes have been attributed to insufficient thrust 
during the critical moments of takeoff. There are many possible indicators of low thrust avail- 
able to a flight crew. For example, a low exhaust gas temperature in an engine may produce re- 
duced thrust. Also, a low turbine fan rotation speed in an engine may imply reduced thrust, be- 
cause fuel input is based partly on this parameter. The problem is how to direct the attention of 
the flight crew without overwhelming them towards information useful for planning compen- 
sating actions in real time. 

A monitoring strategy must take into account the reality that not all sensors should or can be 
checked all of the time. As the operating context of the physical system being monitored chang- 
es, the collection of sensors which provide the most immediate information on the state of the 
system also changes. Further on in this paper, we present our approach to sensor planning in 
monitoring. We describe a method for assessing the importance of predicted events in a system, 
based on reasoning about causal dependencies among events, and about how events relate to in- 
tended goals of the designers or operators of a system. 

Other Work 

Within NASA, there are other projects underway in which the goal is to develop a monitoring 
and a diagnosis capability for aerospace systems. Among these is the kate project at the Kennedy 
Space Center, whose domain is the Shuttle Liquid Oxygen Loading system [1], In this project, 
causal models are used to support sensor validation, fault diagnosis, and the planning of control 
actions. 

The goal of the faultfinder project at Langley Research Center [2] is to develop an inflight mon- 
itoring and diagnosis capability for jet aircraft. These investigators have explored the use of 
multiple representations and multiple levels of abstraction to be able to reason about diverse 
faults, to focus attention during reasoning, and to provide accessible information to a flight 
crew. 

Numerous other examples exist of efforts to develop monitoring and control systems. The read- 
er is referred to Dvorak’s excellent survey of the area [3]. 

The causal reasoning paradigm, which is at the core of our approach to the monitoring problem, 
is now a well-established area of investigation within Artificial Intelligence. The advantages of 
the causal approach, which involves modeling a system at the level of components and mecha- 
nisms, include the ability to reason about unforeseen interactions, the ability to reason about 
dependencies among events, and the ability to generate accessible explanations. The seminal ef- 
forts in this area include Forbus’ process-centered approach [4], de Kleer and Brown’s device- 
centered approach [5], and Kuipers’ qualitative mathematics approach [6]. 

In the specific area of monitoring, Dvorak’s mimic project stands out as the most comprehensive 
current research effort [7j. Dvorak creates a component-connection model of a system and em- 
ploys the qsim qualitative simulator [6] to generate expectations about the system’s nominal 
behavior. An inductive learning method is used to create a set of symptom-fault rules for 
known faults, and these rules support the formation of fault hypotheses whenever sensor data 
does not match predictions from the causal model. When anomalous behavior exists, several 
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fault models can be tracked in parallel until one emerges as the hypothesis with the most ex- 
planatory power. The ability to continue tracking a faulted system is important because large, 
complex systems almost always contain faults and the challenge is to maintain safe operation in 
the presence of faults. 

The Approach 

At the center of our approach to addressing the two issues of dynamic alarm thresholds and sen- 
sor selection is a causal model of the system being monitored and possibly, its environment. 
Simulation of this model directly solves the problem of alarm threshold adjustment. Predicted 
values and their time tags indicate how and when to alter the alarm thresholds associated with 

sensors so that they reflect expectations about the nominal operation of the system in chanaina 
contexts. y y 

Another result of simulation is information about causal dependencies among predicted events of 
a system. This information is used to assess the importance of individual events. Briefly, the 
most important events are taken to be those which either cause or are caused by the greatest 
number of other events. An ordering on predicted events reflecting this causal notion of impor- 
tance serves as the basis for allocating sensor resources to selectively verify the expected be- 
havior of a system [8]. 

In the remainder of this section, we describe (1) the architecture of our predictive monitoring 
system, called premon, (2) what our causal models of physical systems look like, and how they 

are simulated, and finally, (3) our approach to sensor planning, based on analyzing causal de- 
pendencies. 

Architecture 

There are three modules in the premon system: a causal simulator, a sensor planner, and a sen- 
sor interpreter. See Figure 1 . 


The causal simulator takes as input a causal model of the system to be monitored, and a set of 
events describing the initial state of the system and perhaps some future scheduled events. The 
causal simulator produces as output a set of predicted events, and a graph of causal dependencies 
among those events. 

The sensor planner takes as input the causal dependency graph generated by the causal simulator 
and determines which subset of the predicted events should be verified. These events are passed 
on to the sensor interpreter. 


The sensor interpreter compares expected values as predicted by the causal simulator with ac- 
tual values from sensors. Alarms are raised here when there are discrepancies. Finally, the 
most recent sensed data is passed back to the causal simulator to contribute to another predict- 
plan-sense cycle of monitoring. 
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Figure 1. Architecture of PREMON. 

Causal Models and Caus al Simulation 

We represent physical systems as a collection of quantities and mechanisms. Quantities are 
continuous parameters such as temperature, position, and amount-of-stuff. Quantities are 
specified by a physical object, a type, and an order. Examples of quantities are { heater temper- 
ature RATE} and { SWITCH POSITION AMOUNT} . 

Events are discontinuous changes in the value of a quantity. Events are specified by a quantity, a 
value, and a moment. Examples of events are { heater temperature rate positive 6 1 } and { valve- 
1 7 position amount open 0 } . 

Mechanisms capture causal relations between quantities. More specifically, they describe how a 
change in one quantity results in a change in another quantity. Examples of mechanisms are 
heat flow, thermal expansion, latch, and gravity. A mechanism is specified by a time constant, a 
distance, a sign, an efficiency, a bias, an alignment, and a medium. Figure 3 shows the repre- 
sentation of a heat flow mechanism. 

A causal model then, consists of a set of quantities and a set of mechanisms between those quanti- 
ties. A causal model can be represented by a graph where the nodes are quantities and the arcs 
are mechanisms. Simulation of a causal model involves predicting new events, via mechanisms, 
from known or previously predicted events. The simulation method outlined in the next few 
paragraphs is described more fully in [9]. 

When the quantity named in an event appears as the cause quantity in a mechanism, a new event 
is predicted as follows: (1) the quantity of the new event is the effect quantity of the mecha- 
nism, (2) the value of the new event is computed from the value of the given event and the sign 
and efficiency of the mechanism, (3) the moment of the new event is computed from the moment 
of the given event and the time constant and distance of the mechanism, and (4) the new event 
occurs only when constraints specified in the bias, alignment, and medium of the mechanism are 
satisfied. The bias of a mechanism specifies constraints on directions of change. For example, 
current through a wire can cause it to heat up, but not to cool down. The alignment of a mecha- 
nism specifies constraints expressed as inequalities. For example, heat flow is from the warm- 


409 



er to the cooler site. The medium of a mechanism is a physical connection such as a wire, a 
pipe, a linkage, etc. The predicted effect occurs only when the specified physical connection is 
in place. 


A typical event, this one describing a temperature change, is shown in Figure 2. The heat flow 
mechanism in Figure 3 is used to predict another temperature change event, shown in Figure 4. 


QUANTITY Chiller Temperature Rate 
VALUE Negative 

MOMENT 60 

Figure 2. A cause event 

1.0 
10.0 
+ 

0.95 


{Chiller Pipe-4 Mirror} 

Figure 3. A mechanism. 

QUANTITY M irror T emperatu re Rate 
VALUE Negative 
MOMENT 70 


TIME CONSTANT 

DISTANCE 

SIGN 

EFFICIENCY 

BIAS 

ALIGNMENT 

MEDIUM 


Figure 4. An effect event 

Simulation would be straightforward if physical systems could be modeled exclusively as simple 
mechanism chains between input and output quantities. However, some mechanisms serve to 
enable or disable other mechanisms, such as a valve controlling a fluid flow, or a latch inhibit- 
ing the transmission of motion through a mechanical coupling. In these cases, the contributions 
of the separate mechanisms combine multiplicatively. The value contributed by the enabling or 
disabling mechanism can be discrete, as in the case of a switch, or continuous, as in the case of a 
valve. The contributions of separate mechanisms also can combine additively, as when two fluid 
lines empty into the same container, or two opposed forces produce an equilibrium state. 

Sensor Planning 

The output of the causal simulator is a trace of predicted events and the dependencies among 
them. The dependencies are derived from the mechanism structure of the system. A dependency 
between two events is a record that there is a mechanism in the system which causally relates 
the events. 

Analysis of the causal dependencies in a simulation trace supports decisions about which events 
to monitor. In our approach, the importance of events is assessed by determining how many 
other events are effects or causes of a given event. In other words, the importance of an event is 
related to the amount of subsequent activity it supports and the amount of activity which sup- 
ports its occurrence. Critical events which lie on several causal paths between inputs and out- 
puts should be verified with care, perhaps with a battery of sensors. On the other hand, events 
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which are side effects and do not support further activity in the system may be ignored com- 
pletely. See Figure 5. 



Figure 5. Assessing the importance of events. 

This analysis method weights all dependencies in a causal graph equally. Several criteria might 
form the basis of a non-uniform weighting scheme. For example, a priori or empirical knowl- 
edge about probabilities of failure might bias the allocation of sensor resources towards those 
components in a system known to be unreliable. Similarly, parts of a system where redundancy 
has been built in might be given less careful attention than other parts. 

Our causal analysis method for determining what subset of predicted events to monitor is simi- 
lar to the minimum entropy method of de Kleer and Williams [10] for determining the site of 
the most useful next measurement in troubleshooting. Their technique involves propagating ob- 
served values and failure probabilities along a causal dependency graph for a circuit. 

An Example: The JPL Space Simulator 

The JPL Space Simulator is an environmental chamber in which spacecraft and instruments can 
be subjected to some of the aspects of the space environment: intense cold, near vacuum, and 
solar radiation. 

A mirror is used to direct simulated solar radiation onto the spacecraft or instrument inside the 
chamber. This mirror must be cooled separately from the shroud which surrounds the chamber 
to compensate for the additional radiation falling on it. Cold gaseous nitrogen is used as the cool- 
ing medium and is circulated by a fan. Chilling is achieved by injecting liquid nitrogen into the 
gaseous nitrogen. Warming is achieved through an electrical heater. A causal simulation of this 
cooling circuit is shown in Figure 6. 

Using the causal analysis technique outlined above, the flow of gaseous nitrogen at the fan is 
identified as the single most critical event in the predicted nominal behavior of the circuit. This 
event affects gas flow around the entire circuit and indirectly, heat flow around the entire cir- 
cuit. The only events unaffected by this event are the source temperature changes at the chiller 
and heater. This result of causal analysis captures the intuitive notion that nothing at all hap- 
pens in the cooling circuit if the fan stops operating. Other important events in the predicted 
operation of the circuit are the temperature changes at the chiller and heater. Measurements 
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made at these sites also provide informative feedback about the nominal operation of the circuit. 


O Fart Pressure 



Coils Temp O LN2 TempO 


P: Pump GF: Gas Flow HT: Heat Transport HF: Heat Flow 

Figure 6. The minor cooling circuit 

This example has been implemented and illustrates our current causal simulation and sensor 
planning capabilities. We are beginning to apply our developing predictive monitoring capabil- 
ity to other aerospace systems existing or being designed within NASA. One of these is the Mars 
Rover. We are looking at the monitoring problems associated with terrain traversal, power 
distribution, and thermal distribution. In addition, we are looking at telecommunications sys- 
tems used in sending commands to and receiving telemetry from spacecraft, such as antenna 
control systems in the Deep Space Network. Finally, we have looked also at some of the smaller 
earth-orbiting spacecraft. 

Application to Ro bot Task Pla n Execution Monitoring 

The task of monitoring the behavior of a physical system bears similarities to the task of moni- 
toring the execution of a robot task plan [1 1,12,13], In fact, the roots of the work described in 
this paper are in an earlier effort which addressed the generation of expectations and perception 
requests to verify the execution of robot plans [14]. The two issues which form the focus of 
this paper apply also to the monitoring of robot plans: (1) How to predict the sensor values 
which imply correct execution of a plan, and (2) How to generate predictions and interpret 
sensor data selectively to meet real-time constraints with limited computational resources. 

The sensor values which indicate nominal execution of a plan vary. For example, the force 
readings which verify that a gripping action has been performed successfully depend on the grip 
configuration and the properties of the gripped object. The appearance of an object for recogni- 
tion by a vision system depends on its reflectance properties and on lighting. Physical models 
can be used to derive these context-dependent nominal sensor values. Some of these models 
might be causal models describing physical processes in the environment, including the inter- 
actions of the robot with the environment. 
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Moreover, the importance of the individual actions in a plan may differ, and sensor interpreta- 
tion resources should be allocated accordingly. For example, the gripping of a tool prior to a 
sequence of manipulations should be verified carefully, perhaps with a battery of sensors. 
Conversely, a gross motion prior to a gripping action may be verified more cursorily. The ap- 
proach to sensor planning based on causal dependency analysis developed for the physical device 
domain maps directly to the robot task planning domain. The dependency graph in Figure 5 
might describe the logical dependencies among the preconditions and consequences of actions in a 
task plan as easily as the causal dependencies among events in the operation of a physical device. 
As described above, the analysis method distinguishes critical actions from actions whose conse- 
quences are only side effects. Other criteria may also be used to assess the need to verify indi- 
vidual actions in a plan. For example, a motion which makes use of compliance may be assumed 
to be more robust and require less exact verification. 


Other issues which arise in the execution monitoring of robot task plans include: How to deal 
with uncertainty in the world model and in the operation of the robot? How to infer nominal 
execution when direct sensing is not possible? At what point(s) should a condition be verified 
when there is a delay between its establishment by one action and its enabling of another action 
Which sensors to read when several are relevant and how to fuse data from disparate sensors? 
What are the interactions of task planning with sensor planning? 


Conclusions 

Detecting anomalies in the operation of a system is a difficult problem when the behavior of the 
system is complex or involves interaction with an environment, and when the number of sensor 
channels is large. Under these conditions, nominal values and the most informative sensor data 
change according to context. We have addressed two specific issues in monitoring: how to adjust 
alarm thresholds dynamically, and how to verify behavior selectively but reliably. At the cen- 
ter of our approach to solving both problems is the use of a causal model of the system being 
monitored. Simulation of a causal model serves both to generate expectations about nominal 
sensor values, and to provide dependency information useful in assessing the importance of pre- 
dicted events and in allocating sensor resources accordingly. Some aspects of this approach ap- 
pear to be applicable to the task of monitoring the execution of robot plans. 

The key idea in this paper is letting go of the notion of comprehensive monitoring. More likely 
than not, there will be insufficient resources for predicting behavior and interpreting sensor 
data. In the face of this limitation, our emphasis is on verifying the operation of a system effi- 
ciently and reliably, by carefully focusing computational resources to gather the most informa- 
tive, if incomplete, feedback on nominal operation within changing contexts. 
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